Entities

Willis (David)

Publications (11)

Willis, David [princip. invest.], and Marieke Meelen [princip. invest.], PARSHCW: The Parsed Historical Corpus of the Welsh Language, Online, 2023–present. URL: <https://www.celticstudies.net/parshcwl/>.

abstract:

The Parsed Historical Corpus of the Welsh Language (PARSHCWL) is a project to create an annotated corpus of Middle and Early Modern Welsh texts. The texts in various formats (plain text files, Part-of-Speech tagged and parsed files) will be made available in the course of the project on this website. In addition, detailed annotation manuals and guidelines will be made available here to enable any researcher working with Welsh (historical) texts to add morphosyntactic information to their texts, adding to a growing corpus of searchable historical Welsh materials.

Meelen, Marieke, and David Willis, “Towards a historical treebank of Middle and Modern Welsh syntactic parsing”, Journal of Historical Syntax 6:5 (2022): 1–32.

Willis, David, “The development of realis conditional clauses in Welsh”, in: Erich Poppe, Simon Rodway, and Jenny Rowland (eds), Celts, Gaels, and Britons: studies in language and literature from antiquity to the middle ages in honour of Patrick Sims-Williams, Turnhout: Brepols, 2022. 289–310.

Darling, Mark, Marieke Meelen, and David Willis, “Towards coreference resolution for Early Irish”, in: Theodorus Fransen, William Lamb, and Delyth Prys (eds), Proceedings of the 4th Celtic Language Technology Workshop at LREC2022 (CLTW 4), Marseille: European Language Resources Association (ELRA), 2022. 85–93.

Meelen, Marieke, and David Willis, “Creating annotated corpora for historical languages”, Journal of Historical Syntax 6:4 (2022): 1–5.

Meelen, Marieke, and David Willis, “Towards a historical treebank of Middle and Early Modern Welsh, part I: workflow and POS tagging”, Journal of Celtic Linguistics 22 (2021): 125–154.

abstract:

This article introduces the working methods of the Parsed Historical Corpus of the Welsh Language (PARSHCWL). The corpus is designed to provide researchers with a tool for automatic exhaustive extraction of instances of grammatical structures from Middle and Modern Welsh texts in a way comparable to similar tools that already exist for various European languages. The major features of the corpus are outlined, along with the overall architecture of the workflow needed for a team of researchers to produce it. In this paper, the two first stages of the process, namely pre-processing of texts and automated part-of-speech (POS) tagging are discussed in some detail, focusing in particular on major issues involved in defining word boundaries and in defining a robust and useful tagset.

Willis, David, “Old and Middle Welsh”, in: Martin J. Ball, and Nicole Müller (eds), The Celtic languages, 2nd ed., London, New York: Routledge, 2009. 117–160.

Willis, David, “Specifier-to-head reanalyses in the complementizer domain: evidence from Welsh”, Transactions of the Philological Society 105:3 (November, 2007): 432–480.

Willis, David, “Negation in Middle Welsh”, Studia Celtica 40 (2006): 63–88.

Willis, David, “Lexical diffusion in Middle Welsh: the distribution of /j/ in the law texts”, Journal of Celtic Linguistics 9 (2005): 105–133.

abstract:

This article looks at variation in the distribution of /j/ in post-tonic syllables in Middle Welsh. It extends previous studies by looking at variation at the level of the individual lexical item, using data from a stylistically and lexically relatively homogeneous group of law manuscripts from both north and south Wales. Many items show no variation, appearing either with /j/ or without /j/ in all texts. Variable items show different patterns of distribution: for some items, /j/-full forms are restricted to northern texts, and even there compete with /j/-less forms; for other items, the /j/-full forms dominant in the northern texts are found alongside /j/-less forms even in the south. With frequent items, it seems clear that the overall patterns closely resemble those found with cases of lexical diffusion of linguistic innovations. In addition to documenting the patterns of variation, this article makes some proposals as to how they may have arisen. It is suggested that, in the items investigated closely here (plural suffixes and synchronically monomorphemic items), two processes play the major role: a sound change deleting /j/ in the onset of post-tonic syllables, which diffuses south-to-north; and analogical extension of /j/ into the -eu and -oed plural suffixes, restricted to northern varieties.

Mittendorf, Ingo, and David Willis, Corpws hanesyddol yr iaith Gymraeg 1500–1850 = A historical corpus of the Welsh language, 1500–1850, Online: University of Cambridge, 2004–. URL: <https://www.celticstudies.net>.

abstract:

The Historical Corpus of the Welsh Language 1500–1850 is a collection of Welsh texts from the period 1500–1850 in an electronic format. It is the result of a project to encode Welsh texts of the period funded by the Arts and Humanities Research Board (AHRB Resource Enhancement Award RE11900) in the Department of Linguistics at the University of Cambridge between 2001 and 2004. The project's Principal Investigator was David Willis, while Ingo Mittendorf was the project's Research Associate. The aim of the project was to begin to provide an electronically searchable resorce for use in linguistic, literary and historical research, of a kind similar to existing corpora already available for languages such as English, French, German and Irish. The Cambridge project dealt with the early modern Welsh period. Other projects at the University of Wales have provided or are providing similar materials for earlier periods. Although the project came to an end in 2004, it is hoped that resources will become available to allow future extension of the corpus.

The corpus is a planned corpus, and aims to reflect the rich diversity of the texts attested in Welsh during the period 1500–1850 by including texts and samples of texts from different stylistic levels and of varying geographical provenance. A number of the texts included are not available in adequate modern editions or are available only in modernised form, hence the corpus also provides access to a number of texts in an easily available form for the first time. It is hoped that this will encourage further linguistic, literary and historical research on these texts.

The corpus is encoded using Extensive Markup Language (XML) in a format that conforms to the standards of the Text Encoding Initiative (TEI). This should ensure its long-term preservation, and also allows flexibility in the way the texts of the corpus can be displayed and used. The corpus files can be viewed online here, and are also available for download here in a number of formats: as plain XML files; as viewable HTML documents in two formats (diplomatic and edited); as corpus files designed for use with the Concordance software package; and as web-based indexes and concordances. Although the corpus contains no grammatical tagging, the XML files contain some encoding designed to facilitate the usefulness of the corpus as a source for linguistic research. This concerns mainly spelling and graphical variation. Original spelling is maintained, but tagging for scribal errors and extreme orthographic variation is included, and is used in the indexes and concordances. Other editorial conventions are documented here.

The corpus is arranged into different groups of text types in order to represent the stylistic diversity of the Welsh language, while allowing for differences in the specific range of text types actually available at different periods. The texts therefore include drama, personal letters, ballads, political (didactic) prose, scripture, historical narrative, narrative prose, and religious prose. For each text a representative sample of approximately 15,000 words is included. With texts whose total length is less that around 20,000 words, and also in the case of dramatic texts (the interludes) we have generally chosen to include the entire text. Overall the corpus contains around 420,000 words from 30 texts.

(source: Website (8 April 2018))

Sources

No published sources recorded. Try related subjects (if any) instead.